Machine Learning Hw 2

Hw 2

Peter Sullivan
2022-03-13

ISLR Ch. 4, Exercise 16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

summary(Boston$crim)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 

 0.00632  0.08204  0.25651  3.61352  3.67708 88.97620 
Boston$response <- as.factor(ifelse(Boston$crim > median(Boston$crim),"Above","Below"))

Boston %>% filter(response=="Above") %>% head()

     crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat

1 0.62976  0  8.14    0 0.538 5.949 61.8 4.7075   4 307      21  8.26

2 0.63796  0  8.14    0 0.538 6.096 84.5 4.4619   4 307      21 10.26

3 0.62739  0  8.14    0 0.538 5.834 56.5 4.4986   4 307      21  8.47

4 1.05393  0  8.14    0 0.538 5.935 29.3 4.4986   4 307      21  6.58

5 0.78420  0  8.14    0 0.538 5.990 81.7 4.2579   4 307      21 14.67

6 0.80271  0  8.14    0 0.538 5.456 36.6 3.7965   4 307      21 11.69

  medv response

1 20.4    Above

2 18.2    Above

3 19.9    Above

4 23.1    Above

5 17.5    Above

6 20.2    Above

Set up Training and Test Sets

set.seed(2)



train <- sample(1:nrow(Boston), round((nrow(Boston)/4)*3,0))



Boston.test <- Boston[-train,]

boston.response <- Boston$response[-train]

boston.train <- Boston[train,]

Logistic Regression

glm.fits <- glm(response ~ zn + indus+chas+nox+rm+age+dis+rad+tax, family = binomial, data = Boston, subset = train)

summary(glm.fits)



Call:

glm(formula = response ~ zn + indus + chas + nox + rm + age + 

    dis + rad + tax, family = binomial, data = Boston, subset = train)



Deviance Residuals: 

     Min        1Q    Median        3Q       Max  

-3.05155  -0.00772   0.00850   0.28098   1.80684  



Coefficients:

              Estimate Std. Error z value Pr(>|z|)    

(Intercept)  24.861723   4.752476   5.231 1.68e-07 ***

zn            0.061220   0.031442   1.947 0.051524 .  

indus         0.008342   0.047150   0.177 0.859563    

chas         -0.125309   0.847579  -0.148 0.882467    

nox         -34.478533   6.935623  -4.971 6.65e-07 ***

rm           -0.515107   0.353771  -1.456 0.145381    

age          -0.011939   0.009914  -1.204 0.228468    

dis          -0.338121   0.202467  -1.670 0.094918 .  

rad          -0.594197   0.153996  -3.859 0.000114 ***

tax           0.005827   0.002710   2.151 0.031508 *  

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



(Dispersion parameter for binomial family taken to be 1)



    Null deviance: 526.53  on 379  degrees of freedom

Residual deviance: 175.75  on 370  degrees of freedom

AIC: 195.75



Number of Fisher Scoring iterations: 9
#coef(glm.fits)

#summary(glm.fits)$coef





glm.probs <-predict(glm.fits, type = "response", Boston.test)

#glm.probs %>% head()



contrasts(Boston$response)

      Below

Above     0

Below     1
#length(glm.probs)

#dim(Boston)

glm.pred <- rep("Above",126)

glm.pred[glm.probs >.5 ] = "Below"

#Boston$response %>% length()

table(glm.pred,boston.response)

        boston.response

glm.pred Above Below

   Above    55     3

   Below    13    55
mean(glm.pred==boston.response)

[1] 0.8730159
acc <- mean(glm.pred==boston.response)



Prediction_Accuracy <- tibble("Model" = "Logistic Regression", "Accuracy" = acc)

This means we were correct 87% of the time using the logistic analysis in predicting whether the crime level would be above the median using the variables chosen.

LDA (Linear Discriminant Analysis)

library(MASS)



lda.fit <- lda(response ~ zn + indus+chas+nox+rm+age+dis+rad+tax, data = Boston)

#summary(lda.fit)

lda.fit

Call:

lda(response ~ zn + indus + chas + nox + rm + age + dis + rad + 

    tax, data = Boston)



Prior probabilities of groups:

Above Below 

  0.5   0.5 



Group means:

             zn     indus       chas       nox       rm      age

Above  1.201581 15.271265 0.08695652 0.6384190 6.174874 85.83953

Below 21.525692  7.002292 0.05138340 0.4709711 6.394395 51.31028

           dis       rad      tax

Above 2.498489 14.940711 510.7312

Below 5.091596  4.158103 305.7431



Coefficients of linear discriminants:

               LD1

zn     0.004312445

indus -0.014204281

chas  -0.029319347

nox   -7.380732435

rm    -0.253144266

age   -0.010855144

dis    0.010054637

rad   -0.084063067

tax    0.001263545
lda.pred <- predict(lda.fit,Boston)

lda.class <- lda.pred$class



table(lda.class,Boston$response)

         

lda.class Above Below

    Above   194    14

    Below    59   239
mean(lda.class==Boston$response)

[1] 0.8557312
acc <- mean(lda.class==Boston$response)



Prediction_Accuracy <- rbind(Prediction_Accuracy,  c("LDA", acc))

For LDA, 86% of predictions were correct.

Naive Bayes

library(e1071)



nb.fit <- naiveBayes(response ~ zn + indus+chas+nox+rm+age+dis+rad+tax, family = binomial, data = Boston, subset = train)

#nb.fit



nb.class <- predict(nb.fit, Boston.test)

table(nb.class, boston.response)

        boston.response

nb.class Above Below

   Above    52    11

   Below    16    47
mean(nb.class == boston.response)

[1] 0.7857143
acc <- mean(lda.class==Boston$response)



Prediction_Accuracy <- rbind(Prediction_Accuracy, c("Naive Bayes", acc))

Using Naive Bayes, we were able to get a prediction accuracy of 78%.

K nearest Neighbors

library(class)

attach(Boston)

set.seed(2)

#Boston %>% dim()

train.x <- Boston[train,colnames(Boston) %in% c("zn","indus","chas","nox","rm","age","dis","rad","tax")]



test.x <- Boston[-train,colnames(Boston) %in% c("zn","indus","chas","nox","rm","age","dis","rad","tax")]

train.response <- Boston$response[train]

test.response <- Boston$response[-train]

#train.x %>% dim()

#test.x %>% dim()

#train.response %>% length()



knn.pred <- knn(train.x,test.x, train.response, k =1)

#knn.pred %>% length()

#test.response %>% length()

#knn.pred %>% length()

#test.response %>% length()



table(knn.pred,test.response)

        test.response

knn.pred Above Below

   Above    66     7

   Below     2    51
paste("Percent Accuracy with K = 1",mean(knn.pred == test.response))

[1] "Percent Accuracy with K = 1 0.928571428571429"
knn.pred <- knn(train.x,test.x, train.response, k =3)

paste("Percent Accuracy with K =3",mean(knn.pred == test.response))

[1] "Percent Accuracy with K =3 0.968253968253968"
knn.pred <- knn(train.x,test.x, train.response, k =5)

paste("Percent Accuracy with K = 5",mean(knn.pred == test.response))

[1] "Percent Accuracy with K = 5 0.952380952380952"
x<- NULL

y <- NULL

for(i in 1:100){

  knn.pred <- knn(train.x,test.x, train.response, k =i)

  x<- rbind(x,mean(knn.pred == test.response))

  y <- rbind(y,paste(i))

}

data.frame("Kvalue" = y, "Accuracy" = x) %>% arrange(desc(x)) %>% head() %>% kable()

Kvalue Accuracy
3 0.9682540
4 0.9682540
6 0.9603175
5 0.9523810
1 0.9285714
7 0.9285714
knn.pred <- knn(train.x,test.x, train.response, k =3)

acc <- mean(knn.pred == test.response)



Prediction_Accuracy <- rbind(Prediction_Accuracy, c("KNN", acc))

In order to set up the KNN model, I needed to create 4 data sets. A training and Testing data set, and a test response, and training response. I then decided to run the model and use various values of K. The first bit of code is a bit inefficient in deciding the best value of K. This can be seen in the first 3 predictions for K = 1, 3, and 5.

In order to find the best value of K, I decided to create a loop that would try every K value from 1 to 100 and store the models % accuracy in a vector named x. I then created a dataframe to visualize the x values and the corresponding K values used.

Using KNN, I created 100 models with a k value form 1 to 100. The top 6 are listed above. The K value of 3 was the best value for K.

Prediction_Accuracy %>% arrange(desc(Accuracy)) %>% kable()

Model Accuracy
KNN 0.968253968253968
Logistic Regression 0.873015873015873
LDA 0.855731225296443
Naive Bayes 0.855731225296443

I would recommend using the KNN to predict whether a crime level will be higher than the median for the Boston data set. It has the highest accuracy at 97%.


ISLR Ch. 5, Exercise 3 (or substitute advanced: Exercise 2)

A.

K fold is implemented by dividing the set of cases into a number of groups which is dictated by K. Each group will have an equal length.

The first group is the validation set. The model is trained on the other groups. So if we had a value of 10 for K. Group 1 would be the validation set and the 9 other groups would be the training groups. A Mean squared error is calculated for the group. This process will be carried out for each group. At the end we will have a MSE for each group. Finally, we take the average of all the MSE’s and that gives us the k fold CV estimate.

B.

The Validation Set Approach:

If K is set to low, then it approaches the same as using the validation set approach. Test Error rate will most likely be higher in the validation approach when compared to kfold validation.

Validation set approach will most likely have a high bias and cause over fitting when compared to the k fold approach.

LOOCV:

Computation advantage when using kfold versus LOOCV. K fold will require less computation/resources/time, then compared to the LOOCV.

LOOCV has less bias, which means less overfitting when applied to other datasets, but will have more variance when compared to K fold and when compared to the validation set approach.


ISLR Ch. 5, Exercise 5

set.seed(5)



Data <- Default

#Data %>% head()

#Data %>% names()



train <-  sample(1:nrow(Data), round((nrow(Data)/4)*3,0))

#train %>% length()

#Data %>% dim()





### Create training Sets

Data.test <- Data[-train,]

Data.response <- Data$default[-train]





# Fit model

glm.fits <- glm(default ~ income+balance, family = binomial, data = Data, subset = train)

#summary(glm.fits)

#coef(glm.fits)

#summary(glm.fits)$coef

glm.probs <-predict(glm.fits, type = "response", Data.test)





contrasts(Data$default)

    Yes

No    0

Yes   1
#length(glm.probs)

#dim(Data.test)

glm.pred <- rep("No",2500)

glm.pred[glm.probs >.5 ] = "Yes"



table(glm.pred,Data.response)

        Data.response

glm.pred   No  Yes

     No  2417   48

     Yes    7   28
mean(glm.pred==Data.response)

[1] 0.978
paste0("The test Error Rate is : ",(1-(mean(glm.pred==Data.response)))*100,"%" )

[1] "The test Error Rate is : 2.2%"


For this problem, we will run the sample function 3 times to get a different training set. I will fit the model, and then I will calculate the test error rate again. For ease, I will create a loop. I create a new seed each time I run the loop.

for(i in 1:3){

set.seed(i)

train <-  sample(1:nrow(Data), round((nrow(Data)/4)*3,0))

# Create training Sets

Data.test <- Data[-train,]

Data.response <- Data$default[-train]

# Fit model

glm.fits <- glm(default ~ income+balance, family = binomial, data = Data, subset = train)

glm.probs <-predict(glm.fits, type = "response", Data.test)

glm.pred <- rep("No",2500)

glm.pred[glm.probs >.5 ] = "Yes"

mean(glm.pred==Data.response)

print(paste0("The test Error Rate is : ",(1-(mean(glm.pred==Data.response)))*100,"%" ))

}

[1] "The test Error Rate is : 2.6%"

[1] "The test Error Rate is : 2.12%"

[1] "The test Error Rate is : 2.6%"

When Running the model with 3 different seeds. I saw that the error rate fluctuated from 2.12 % up to 2.6%.

glm.fits <- glm(default ~ ., family = binomial, data = Data, subset = train)

glm.probs <-predict(glm.fits, type = "response", Data.test)

glm.pred <- rep("No",2500)

glm.pred[glm.probs >.5 ] = "Yes"



table(glm.pred,Data.response)

        Data.response

glm.pred   No  Yes

     No  2416   54

     Yes   13   17
mean(glm.pred==Data.response)

[1] 0.9732
paste0("The test Error Rate is : ",(1-(mean(glm.pred==Data.response)))*100,"%" )

[1] "The test Error Rate is : 2.68%"

Using the dummy variable is about the same as the last error rate calculated in part c (2.68). I’m comparing it to that result, since the current sample seed is the one used in that loop.


ISLR Ch. 5, Exercise 9

u_ <- Boston$medv %>% mean()

paste("The population mean of the medv:",u_)

[1] "The population mean of the medv: 22.5328063241107"

sd_error <- Boston$medv%>%sd()/(length(Boston)^(1/2))



paste("The Standard error of the sample mean is:", sd_error)

[1] "The Standard error of the sample mean is: 2.45802946039096"
library(boot)



alpha.fn <- function(data, index){

  medv <-data$medv[index]

  mean_1 <- mean(medv)

 

}

#Boston$medv %>% length()





boot(Boston, alpha.fn,10000)



ORDINARY NONPARAMETRIC BOOTSTRAP





Call:

boot(data = Boston, statistic = alpha.fn, R = 10000)





Bootstrap Statistics :

    original       bias    std. error

t1* 22.53281 0.0006395257    0.407591

The standard error is lower using the Bootstrap when compared to the answer from part c.

t.test(Boston$medv)



    One Sample t-test



data:  Boston$medv

t = 55.111, df = 505, p-value < 2.2e-16

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

 21.72953 23.33608

sample estimates:

mean of x 

 22.53281 
mean_boot <- 22.53281

std.boot <- .4137494



upperbound <- mean_boot + 2*std.boot

lowerbound <- mean_boot -2*std.boot



paste0("The confidence interval using the boot: [",lowerbound," , ",upperbound,"]")

[1] "The confidence interval using the boot: [21.7053112 , 23.3603088]"


The boot and the t test confidence intervals are very close, almost identical.


paste("The median value of medv in the population", Boston$medv %>% median())

[1] "The median value of medv in the population 21.2"

alpha.median <- function(data, index){

  medv <-data$medv[index]

  median_1 <- median(medv)



}



boot(Boston, alpha.median, 10000)



ORDINARY NONPARAMETRIC BOOTSTRAP





Call:

boot(data = Boston, statistic = alpha.median, R = 10000)





Bootstrap Statistics :

    original    bias    std. error

t1*     21.2 -0.013115   0.3808402

The standard error of the median is .3768. This standard error is very small for the original median of 21.2.

quantile(Boston$medv, probs = c(.1,.2,.5,.75,.9))

  10%   20%   50%   75%   90% 

12.75 15.30 21.20 25.00 34.80 
paste("The 10th percentile is :", quantile(Boston$medv, probs = .1),"%")

[1] "The 10th percentile is : 12.75 %"

alpha.quantile <- function(data, index){

  medv <-data$medv[index]

  perc_10 <- quantile(medv, probs = .1)



}



boot(Boston, alpha.quantile, 10000)



ORDINARY NONPARAMETRIC BOOTSTRAP





Call:

boot(data = Boston, statistic = alpha.quantile, R = 10000)





Bootstrap Statistics :

    original   bias    std. error

t1*    12.75 0.007025   0.4981081
detach(Boston)

The standard error for the 10% quantile is .5 for a original value of 12.75.


ISLR Ch. 6, Exercise 2

Referenced Article: https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

I found this article helpful in understanding accuracy with reference to bias/variance tradeoff.

Figure 2.7 in Textbook also helped with flexibility of different models.

A.

I

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Incorrect. Lasso is less flexible than Least squares.

II

More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Incorrect - less flexible than OLS

III

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Correct - Low bias and high variance will cause overfitting, but will have a higher prediction accuracy than underfitting the data. since the variance is not increasing, we will have more coefficents that explain the response being measuared, and lamda will not go to zero for those predictors.

IV

Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Incorrect - Increase in variance < then decrease in bias would give a situation of high variance and high bias.

Also this is not how the lasso works, lambda will get rid of the variables that have high variance, by setting their cofficents to

B

Repeat (a) for ridge regression relative to least squares.

i

Incorrect - Less Flexible

II

Incorrect - Less Flexible

III

Correct - As lamda increases bias, flexibility decreases, which means we get a decreased variance and increased bias.

Slight increase in bias with less variance increases prediction accuracy.

IV

Incorrect. Larger increase in varriance associated with smaller decrease in bias would not improve prediction accuracy.

C

Repeat (a) for non-linear methods relative to least squares.

i

Correct - More flexible model, Slight increase in bias with less variance increases prediction accuracy.

ii-iV

Incorrect.


ISLR Ch. 6, Exercise 9

(omit e & f) (requires time and effort; please collaborate & use Piazza)

In this exercise, we will predict the number of applications received using the other variables in the College data set.

A

Split the data set into a training set and a test set.

library(ISLR2)



set.seed(1)



#College %>% dim()

#College %>% names()

#College %>% head()

#?College



train <-  sample(1:nrow(College), round((nrow(College)/4)*3,0))

#train %>% length()

#train %>% head()







# Create Test actuals and training Sets



College.test <- College[-train,]

College.response <- College$Apps[-train]

college.train <- College[train,]

After looking into College data set, I identified that our response variable will be the column Apps. The training set will be the college dataframe filtered on the train index.

B

Fit a linear model using least squares on the training set, and report the test error obtained.

attach(College)

lm.fit <- lm(Apps ~., data = College, subset = train)

lm.pred <- predict(lm.fit, College.test, type = "response")

lm_mse <- mean((College.response -lm.pred )^2)

lm_mse

[1] 1389123
MSE_dataframe <- data.frame("model" = "Least Square", "MSE" = lm_mse)

C

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)





x <- model.matrix(Apps~.,College)[,-1]

y <- College$Apps

grid <- 10^seq(10,-2, length = 100)



ridge.mod <- glmnet(x[train,], y[train], alpha =0, lambda = grid )

cv.out <- cv.glmnet(x[train,], y[train],alpha = 0)

plot(cv.out)

lamda_best <- cv.out$lambda.min

lamda_best

[1] 364.3384
ridge.pred <- predict(ridge.mod, s = lamda_best, newx = x[-train,])

ridge_mse <- mean((ridge.pred - y[-train])^2)

MSE_dataframe <- rbind(MSE_dataframe, c("ridge",ridge_mse))



paste("The test Error Rate is : ", mean((ridge.pred - y[-train])^2))

[1] "The test Error Rate is :  1204431.39092091"

D

Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso.mod <- glmnet(x[train,],y[train],alpha = 1, lambda = grid)

cv.out <- cv.glmnet(x[train,], y[train], alpha = 1)

plot(cv.out)

lamda_best <- cv.out$lambda.min

lamda_best

[1] 2.133937
lasso.pred <- predict(lasso.mod, s= lamda_best, newx = x[-train,])

paste("The test error rate is ",mean((lasso.pred - y[-train])^2))

[1] "The test error rate is  1374692.34765744"
lasso_mse <- mean((lasso.pred - y[-train])^2)



out <- glmnet(x,y, alpha = 1 ,lambda = grid)

lamda.coef <- predict(out, type = "coefficients", s= lamda_best)[1:18,]

paste("The number of non-zero coeficients:",lamda.coef[lamda.coef !=0] %>% length())

[1] "The number of non-zero coeficients: 18"
MSE_dataframe <- rbind(MSE_dataframe, c("Lasso",lasso_mse))

None of the coefficients are zero.

E

Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

set.seed(10)

pcr.fit <- pcr(Apps ~ . , data = College, subset = train, scale = TRUE, validation = "CV")

validationplot(pcr.fit, val.type = "MSEP")

pcr.pred <- predict(pcr.fit,College.test, ncomp = 17)

pcr_mse<- mean((pcr.pred - College.response)^2)

paste("The MSE error is ", pcr_mse)

[1] "The MSE error is  1389123.27022729"
MSE_dataframe <- rbind(MSE_dataframe, c("PCR",pcr_mse))

From the validation plot, the lowest MSEP for the number of components is around 16,17, and 18.I decided to use 17 as my value of M.

F

Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.fit <- plsr(Apps~., data = College, subset = train, scale = TRUE, validation = "CV")



validationplot(pls.fit, val.type = "MSEP")

pls.pred_7<- predict(pls.fit, College.test, ncomp = 7)

pls.pred_8 <- predict(pls.fit, College.test, ncomp = 8)



paste("The MSE error using M as 7: ", mean((pls.pred_7- College.response)^2))

[1] "The MSE error using M as 7:  1336641.93400001"
paste("The MSE error using M as 8: ", mean((pls.pred_8- College.response)^2))

[1] "The MSE error using M as 8:  1362535.73332994"
pls_mse <- mean((pls.pred_8- College.response)^2)

MSE_dataframe <- rbind(MSE_dataframe, c("PLS",pls_mse))

The number of components ranges from 7 to 8. I will predict using both values of M.

G

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

MSE_dataframe %>% arrange(MSE) %>% kable()

model MSE
ridge 1204431.39092091
PLS 1362535.73332994
Lasso 1374692.34765744
Least Square 1389123.27022729
PCR 1389123.27022729
MSE_dataframe$MSE <- as.numeric(MSE_dataframe$MSE)



MSE_dataframe %>% ggplot2::ggplot(aes(x = model, y = MSE)) + geom_col(fill = "deepskyblue3")+ theme(axis.text.y = element_blank())+labs(title = "MSE per Model")+geom_text(aes(label = round(as.numeric(MSE),0)), vjust = 2)

The Ridge regression model had the lowest MSE, with PLS coming up the second lowest MSE, following Lasso and the PCR and Least square in last. There is not too much of a difference between the MSE’s, it looks larger due to the scale on the graph.


ISLR Ch. 8, Exercise 4

A.

B


ISLR Ch. 8, Exercise 7

library(randomForest); library(ggplot2)

set.seed(1)

train <- sample (1: nrow (Boston), (nrow (Boston) / 4)*3)

boston.response_test <- Boston$medv[-train]

df <- tibble("predictors" = c(),"ntree" = double(), "MSE" = double())





for(i in seq(1, 13,2)){

  for(x in seq(1, 600,25)){

  bag.boston <- randomForest(medv ~., data = Boston, subset = train, mtry = i,ntree = x)

yhat.bag <- predict(bag.boston, newdata = Boston[-train,])

mse_1 <- (mean((yhat.bag- boston.response_test)^2))

df <-df %>% add_row(predictors = as.character(i),ntree = x, MSE = mse_1)

  }

}



plot<- df %>% ggplot()+geom_line(aes(x = ntree, y = MSE, group = predictors,color = predictors, linetype = predictors))



plotly::ggplotly(plot)

When starting off with less trees, we see that the MSE’s for model is higher. As we increase trees, the MSE drops significantly for all Models. The model that only uses one predictor seems to have the highest MSE ranging from 30 - 25. The models with 3, 5 and 7 seem to have the lowest amount of MSE. Theres no need to have more than 100 Ntrees to get the lowest MSE.


ISLR Ch. 8, Exercise 9

A


Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

## OJ dataset

library(ISLR2)

set.seed(1)



train <- sample(1:nrow(OJ), 800)

#train %>% length()

OJ.test <- OJ[-train,]

OJ.response <- OJ$Purchase[-train]

OJ.train <- OJ[train,]

OJ.train.response <- OJ$Purchase[train]





#OJ.test %>% dim()

#OJ.train %>% dim()

#OJ %>% dim()

#OJ.response %>% length()

#OJ.train.response %>% length()

B

Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

library(tree)

attach(OJ)

tree.oj <- tree(Purchase ~.,data = OJ, subset = train)

summary(tree.oj)



Classification tree:

tree(formula = Purchase ~ ., data = OJ, subset = train)

Variables actually used in tree construction:

[1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"

[5] "PctDiscMM"    

Number of terminal nodes:  9 

Residual mean deviance:  0.7432 = 587.8 / 791 

Misclassification error rate: 0.1588 = 127 / 800 

The training error rate is 16% There are 8 terminal nodes for this tree. The residual mean deviance is 75%.

C

Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree.oj

node), split, n, deviance, yval, (yprob)

      * denotes terminal node



 1) root 800 1073.00 CH ( 0.60625 0.39375 )  

   2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  

     4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  

       8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *

       9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *

     5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  

      10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 )  

        20) SpecialCH < 0.5 64   51.98 MM ( 0.14062 0.85938 ) *

        21) SpecialCH > 0.5 15   20.19 CH ( 0.60000 0.40000 ) *

      11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *

   3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  

     6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  

      12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  

        24) PctDiscMM < 0.196197 55   73.14 CH ( 0.61818 0.38182 ) *

        25) PctDiscMM > 0.196197 17   12.32 MM ( 0.11765 0.88235 ) *

      13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *

     7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *

The terminal nodes are shown by the asterix at the end. Lets look at branch 7. This ends in a terminal node. The split is LoyaCH > .764572. There are 123.80 observations in this branch. The prediction for this branch is CH.

D

Create a plot of the tree, and interpret the results.

plot(tree.oj)

text(tree.oj, pretty = 0)

The criteria most used are loyal CH and price diff.

e

Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

tree.pred <- predict(tree.oj, OJ.test, type = "class")

x <- table(tree.pred, OJ.response)





table(tree.pred,OJ.response) %>% kable()

CH MM
CH 160 38
MM 8 64
paste("The Test error rate is :", round(1-(x[1,1]+x[2,2])/sum(x),4),"%")

[1] "The Test error rate is : 0.1704 %"

F, G, H

Apply the cv.tree() function to the training set in order to determine the optimal tree size.

Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

Which tree size corresponds to the lowest cross-validated classification error rate?

cv.oj <- cv.tree(tree.oj, FUN = prune.misclass )





par(mfrow = c(1,2))

plot(cv.oj$size, cv.oj$dev, type = "b", xlab = "Number of Trees", ylab = "Crossvalidation Error")

plot(cv.oj$k, cv.oj$dev, type = "b", xlab = "Cross complexity Parameter", ylab = "Crossvalidation Error")

According to the figure above, the Crossvalidation error was the lowest with the number of trees at 8.

I

Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

prune.oj <- prune.misclass(tree.oj, best = 8)

plot(prune.oj)

text(prune.oj, pretty = 0)

J

Compare the training error rates between the pruned and unpruned trees. Which is higher?

prune.oj.pred <- predict(prune.oj, OJ.train, type = "class")

prune.oj.pred %>% length()

[1] 800
OJ.train.response %>% length()

[1] 800
x<- table(prune.oj.pred,OJ.train.response)







 table(prune.oj.pred,OJ.train.response)%>% kable()

CH MM
CH 450 92
MM 35 223
paste("The Test error rate is :", round(1-(x[1,1]+x[2,2])/sum(x),3),"%")

[1] "The Test error rate is : 0.159 %"

The training error rate is slightly less at 15% compared to the 16 % shown in the original training set.

K

Compare the test error rates between the pruned and unpruned trees. Which is higher?

prune.test.pred <- predict(prune.oj, OJ.test, type = "class")



x<- table(prune.test.pred,OJ.response)







 table(prune.test.pred,OJ.response)%>% kable()

CH MM
CH 160 38
MM 8 64
paste("The Test error rate is :", round(1-(x[1,1]+x[2,2])/sum(x),4),"%")

[1] "The Test error rate is : 0.1704 %"

The test error from the pruned tree is the same as the unpruned tree. We only removed one branch from the original 9 and the CV error rates looked pretty similar for both.


ISLR Ch. 8, Exercise 10

We now use boosting to predict Salary in the Hitters data set.

A

Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

Hitters %>% filter(is.na(Hitters$Salary)) %>% count()

   n

1 59
Hitters %>% dim()

[1] 322  20
Hitters_new <- Hitters %>% filter(!is.na(Hitters$Salary))

Hitters_new %>% dim()

[1] 263  20

Using !is.na(), I filtered the dataset to filter out all rows with an NA salary.

Next we will log transform the Salaries

Hitters_new$Salary <- log(Hitters_new$Salary)

B

Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

set.seed(10)



train <- sample(1:nrow(Hitters_new), 200)

#train %>% length()



Hitters.test <- Hitters_new[-train,]

Hitters.response <- Hitters_new$Salary[-train]

Hitters.train <- Hitters_new[train,]

Hitters.train.response <- Hitters_new$Salary[train]





#Hitters.test %>% dim()

#Hitters_new %>% dim()

#Hitters.train.response%>% length()

#Hitters.response %>% length()

C

Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

library(gbm)



## Example Boost function, summary, and way to calculate MSE

boost.Hitters <- gbm(Salary~.,data = Hitters_new[train,],

                     distribution = "gaussian", n.trees = 1000,

                     interaction.depth = 4, shrinkage = .001)



summary(boost.Hitters)

                var     rel.inf

CAtBat       CAtBat 28.48705371

CRuns         CRuns 26.44960088

CHits         CHits 11.53406819

CRBI           CRBI  9.37181852

CWalks       CWalks  6.10917041

Hits           Hits  3.05905465

CHmRun       CHmRun  2.19862661

Walks         Walks  2.17115968

HmRun         HmRun  2.01444003

AtBat         AtBat  1.94060681

Years         Years  1.86440763

RBI             RBI  1.73382350

PutOuts     PutOuts  1.62680896

Runs           Runs  0.89988120

Errors       Errors  0.16759670

Assists     Assists  0.12090202

Division   Division  0.11322110

NewLeague NewLeague  0.08105077

League       League  0.05670863
yhat.boost <- predict(boost.Hitters, newdata = Hitters_new[-train,], n.trees = 1000)

yhat.boost %>% length()

[1] 63
paste("The MSE test error rate is",mean((yhat.boost- Hitters.response)^2))

[1] "The MSE test error rate is 0.420831510594534"

Using the method above, I can now calculate the MSE using the boost model, yhat predictions and the actual results on the training set. Next I will create a loop that will run through multiple values of shrinage and add the calculated MSE’s to a dataframe. I will then plot those values.

boost_df <- tibble("Shrinkage" = double(), "MSE" = double())





for(i in seq(.0001, .25,.01)){

  

  boost.Hitters <- gbm(Salary~.,data = Hitters_new[train,],

                     distribution = "gaussian", n.trees = 1000,

                     interaction.depth = 4, shrinkage = i)

  yhat.boost <- predict(boost.Hitters, newdata = Hitters_new[train,], n.trees = 1000)

mse_1 <- mean((yhat.boost- Hitters.train.response)^2)





boost_df <- boost_df %>% add_row(Shrinkage = i, MSE = mse_1)

  

}



















plot<- boost_df %>% ggplot()+geom_line(aes(x = Shrinkage, y = MSE))+xlab("Shrinkage (lambda)")



plotly::ggplotly(plot)

D

Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

For this, I will do the exact steps as done in part C, but I will use the test set instead of the training set when making predicitons.

boost_df_test <- tibble("Shrinkage" = double(), "MSE" = double())





for(i in seq(.0001, .25,.01)){

  

  boost.Hitters <- gbm(Salary~.,data = Hitters_new[train,],

                     distribution = "gaussian", n.trees = 1000,

                     interaction.depth = 4, shrinkage = i)

  yhat.boost <- predict(boost.Hitters, newdata = Hitters_new[-train,], n.trees = 1000)

mse_1 <- mean((yhat.boost- Hitters.test$Salary)^2)



boost_df_test <- boost_df_test %>% add_row(Shrinkage = i, MSE = mse_1)

  

}





plot<- boost_df_test %>% ggplot()+geom_line(aes(x = Shrinkage, y = MSE))+xlab("Shrinkage (lambda)")



plotly::ggplotly(plot)

boost_df_test %>% summarise(min(MSE))

# A tibble: 1 x 1

  `min(MSE)`

       <dbl>

1      0.377
which.min(boost_df_test$MSE)

[1] 2
boost_df_test[which.min(boost_df_test$MSE),]

# A tibble: 1 x 2

  Shrinkage   MSE

      <dbl> <dbl>

1    0.0101 0.377

Min MSE can be found at .0101 Lamda.

E

Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

For this I decided to use multiple models from ch. 3 and ch. 6. See below:

library(ggplot2)



## Linear Model



lm.fit <- lm(Salary ~., data = Hitters_new, subset = train)

lm.pred <- predict(lm.fit, Hitters_new[-train,], type = "response")

lm_mse <- mean((lm.pred - Hitters.response)^2)

paste("The Linear Model MSE is: ",lm_mse)

[1] "The Linear Model MSE is:  0.576163818320103"
MSE_dataframe <- data.frame("model" = "Least Square", "MSE" = lm_mse)







## Boost

 boost.Hitters <- gbm(Salary~.,data = Hitters_new[train,],

                     distribution = "gaussian", n.trees = 1000,

                     interaction.depth = 4, shrinkage = .0101)

  yhat.boost <- predict(boost.Hitters, newdata = Hitters_new[-train,], n.trees = 1000)

mse_1 <- mean((yhat.boost- Hitters.response)^2)





MSE_dataframe <- rbind(MSE_dataframe, c("Boosting",mse_1))



paste("The test Error Rate for Boosting is : ", mean((yhat.boost - Hitters.response)^2))

[1] "The test Error Rate for Boosting is :  0.382735329466136"
## Ridge

set.seed(10)

x <- model.matrix(Salary~.,Hitters_new)[,-1]

y <- Hitters_new$Salary

grid <- 10^seq(10,-2, length = 100)



ridge.mod <- glmnet(x[train,], y[train], alpha =0, lambda = grid )

cv.out <- cv.glmnet(x[train,], y[train],alpha = 0)

lamda_best <- cv.out$lambda.min

ridge.pred <- predict(ridge.mod, s = lamda_best, newx = x[-train,])

ridge_mse <- mean((ridge.pred - y[-train])^2)

MSE_dataframe <- rbind(MSE_dataframe, c("ridge",ridge_mse))



paste("The test Error Rate for Ridge is : ", mean((ridge.pred - y[-train])^2))

[1] "The test Error Rate for Ridge is :  0.581396338527786"
## Lasso



lasso.mod <- glmnet(x[train,],y[train],alpha = 1, lambda = grid)

cv.out <- cv.glmnet(x[train,], y[train], alpha = 1)

lamda_best <- cv.out$lambda.min

lasso.pred <- predict(lasso.mod, s= lamda_best, newx = x[-train,])

paste("The test error rate for lasso ",mean((lasso.pred - y[-train])^2))

[1] "The test error rate for lasso  0.575921346257875"
lasso_mse <- mean((lasso.pred - y[-train])^2)

MSE_dataframe <- rbind(MSE_dataframe, c("Lasso",lasso_mse))







## PCR



pcr.fit <- pcr(Salary ~ . , data = Hitters_new, subset = train, scale = TRUE, validation = "CV")

validationplot(pcr.fit,val.type = "MSEP")

pcr.pred <- predict(pcr.fit,Hitters.test, ncomp = 10)

pcr_mse<- mean((pcr.pred - Hitters.response)^2)

# lowest MSEP at 10 componenets.

paste("The MSE error for PCR is ", pcr_mse)

[1] "The MSE error for PCR is  0.619877318874639"
MSE_dataframe <- rbind(MSE_dataframe, c("PCR",pcr_mse))







## PLS



pls.fit <- plsr(Salary~., data = Hitters_new, subset = train, scale = TRUE, validation = "CV")

validationplot(pls.fit, val.type = "MSEP")

# Lowest  MSEP at 5 components

pls.pred_7<- predict(pls.fit, Hitters.test, ncomp = 5)

paste("The MSE error using M as 5: ", mean((pls.pred_7- Hitters.response)^2))

[1] "The MSE error using M as 5:  0.617966162736438"
pls_mse <- mean((pls.pred_7- Hitters.response)^2)

MSE_dataframe <- rbind(MSE_dataframe, c("PLS",pls_mse))









#MSE_dataframe %>% arrange(MSE)

MSE_dataframe$MSE <- as.numeric(MSE_dataframe$MSE)



MSE_dataframe %>% ggplot2::ggplot(aes(x = model, y = MSE)) + geom_col(fill = "deepskyblue3")+theme(axis.text.y = element_blank())+labs(title = "MSE per Model")+geom_text(aes(label = round(as.numeric(MSE),3)), vjust = 2)

Boosting the model will give us the lowest MSE, with PCR in second place. Boosting is the model of choice by far compared to the other models.

F

Which variables appear to be the most important predictors in the boosted model?

summary(boost.Hitters)

                var    rel.inf

CAtBat       CAtBat 21.3776399

CRuns         CRuns 18.4554725

CHits         CHits  9.4230585

CRBI           CRBI  9.1617592

CWalks       CWalks  7.1643822

Years         Years  5.2568153

CHmRun       CHmRun  4.2210637

Walks         Walks  3.0531623

PutOuts     PutOuts  3.0017692

AtBat         AtBat  2.9358935

Hits           Hits  2.8943762

Errors       Errors  2.7801941

RBI             RBI  2.7298938

HmRun         HmRun  2.5411422

Runs           Runs  2.1241977

Assists     Assists  1.5876161

Division   Division  0.4762580

NewLeague NewLeague  0.4306447

League       League  0.3846610

At bat and runs have the most influence.

G

Now apply bagging to the training set. What is the test set MSE for this approach?

set.seed(1)





bag_data <- tibble("predictors" = c(),"ntree" = double(), "MSE" = double())







for(i in seq(1, 19,2)){

  for(x in seq(1, 600,100)){

  bag.hitters <- randomForest(Salary ~., data = Hitters.train, mtry = i,ntree = x)

yhat.bag <- predict(bag.hitters, newdata = Hitters.test)

mse_1 <- (mean((yhat.bag- Hitters.response)^2))

bag_data <-bag_data %>% add_row(predictors = as.character(i),ntree = x, MSE = mse_1)

  }

}



plot<- bag_data %>% ggplot()+geom_line(aes(x = ntree, y = MSE, group = predictors,color = predictors, linetype = predictors))



plotly::ggplotly(plot)

Based on the plot, we should use 13 predictors and 100 trees.

 bag.hitters <- randomForest(Salary ~., data = Hitters.train, mtry = 13,ntree = 100)

yhat.bag <- predict(bag.hitters, newdata = Hitters.test)

paste("The MSE for bagging is : ",(mean((yhat.bag- Hitters.response)^2)))

[1] "The MSE for bagging is :  0.362244453542562"

The .3622 is slightly lower than the boosting MSE of .38.

Final Project IDEAS

Identify a data set that you plan to use for your project/poster and your likely collaborators. What outcome of interest do will you attempt to predict? Why do you expect that the available features (variables), or some subset of them, should help predict this outcome?

For the project, I plan to work alone. If I do stumble on certain problems, I will reach out via Piazza or I will go in for office hours.

For my dataset I went to Kaggle.com. I’m not entirely sure what data set I will use. I found the following data sets that I could possibly use:

For a descrete analysis I could use the Amazon seller dataset from Kagle. This data set is trying to predict whether an amazon order will go through. https://www.kaggle.com/pranalibose/amazon-seller-order-status-prediction This data set was created with the purpose of predicting order sucesses.

I’d also like to possibly use stock data and indicators to predict whether a stock will increase or decrease, and to predict future prices. I know that a lot of people use technical indicators to make buy and sell decisions. I’d like to give that an attempt as well.

I also found another data set that was interesting from Kagle. This data set looks at wind power. This data set is trying to predict how much wind is generated by the windmill in the following 15 days. https://www.kaggle.com/theforcecoder/wind-power-forecasting

Citation

For attribution, please cite this work as

Sullivan (2022, March 13). Project List: Machine Learning Hw 2. Retrieved from https://pjsulliv34.github.io/Blog/posts/machine-learning-hw-2/

BibTeX citation

@misc{sullivan2022machine,
  author = {Sullivan, Peter},
  title = {Project List: Machine Learning Hw 2},
  url = {https://pjsulliv34.github.io/Blog/posts/machine-learning-hw-2/},
  year = {2022}
}